A: GMDH - Group Method of Data Handling. It is a statistical learning network technology using the cybernetical approach of self-organization including systems, information and control theory and computer science. GMDH is not a traditional statistical modeling method. It is an interdisciplinary approach to overcome some main disadvantages of statistics and NN's.
You can find a short intro in Paper2 (section Self-organizing modeling technologies) on the KnowledgeMiner web site. You may also want to look at the publications area for more information.
Q: What kind of models does KnowledgeMiner create? I'm not a statistician and need an overview of possibilities so i can apply it to my field.
A: KnowledgeMiner can create two types of models automatically:
- parametric regression models using GMDH (all versions) and
- nonparametric models (patterns) using Analog Complexing (KM pro only).
Parametric Regression models
GMDH as a statistical learning network technology subdivides a complex modeling problem into smaller, easily handable problems (network of so-called neurons) and solves these smaller problems by advanced statistical methods (statistical learning). To make this process highly automated (learning unknown relationships between variables) and to ensure reliable models several cybernetic principles are considered also. Statistical learning and implementation of important cybernetic principles are what makes GMDH superior to Neural Networks while the network concept and cybernetic principles again are what makes it superior to statistics. However, result of all three modeling methods are statistical regression models. Therefore, we are now, for the moment, in the domain of statistics.
Again, we must distinguish between
- static and dynamic models,
- time series and input-output models and
- linear and nonlinear models.
A static model describes static relationships (no dependence from time) between input variables Xm and an output variable Y in the form:
Y=f(Xm) , m=1, 2, ..., p.
For example, the relation
PROFIT = TURNOVER - COSTS
is a static model of an economical system.
Dynamic models are used to model and predict the dynamic behavior of a time process, the evolution of a variable over time t. Therefore, additionally lagged samples of the output variable Y(t-n) and/ or lagged samples of the input variables Xm(t-n) are considered:
where t reflects a specific time horizon (day, week, month etc).
For example, the relation
PROFIT(t) = 1.1 * PROFIT(t-1)
is a simple dynamic model to describe the evolution of the profit.
A time series model (or auto-regressive model) is intended to reflect the evolution of a variable over time mathematically by exclusively looking at historical data of the variable itself. It is a model in which the output variable Y(t) will be described only by its lagged samples Y(t)=f(Y(t-n)), whereby f is the unknown transfer function and n is the chosen max. time lag (n > 0). Since real world processes are influenced by different variables, time series models can only have limited significance for modeling, prediction and analysis of complex processes. Time series models are always dynamic models.
Input-output models can either be statically or dynamically. The output variable Y is described by a number of different input variables Xm:
Y=f(Xm) , m=1, 2, ..., p.
In this way it is possible to find a relationship between a number of potential input variables and an output variable (multivariate regression analysis).
In KnowledgeMiner, linear and nonlinear models are parametric, polynomial models. The output variable Y is described by a polynomial function of one or more input variables Xm (m >= 2):
Y=f(X1, X2, ..., Xm).
In linear models only the first order polynomials of the input variables are considered. The linear model of two input variables would look like this:
Y=a + b*X1 + c*X2.
A, b and c are the model parameters or weights.
Nonlinear models are composed additionally by at least one higher order polynomial. A nonlinear model of two input variables may look like this:
Y=a + b*X1 + c*X1*X2 + d*X2*X2.
Nonparametric models
Nonparametric models are selected from a given variables set of observations by Analog Complexing representing one or more patterns of a trajectory of past behavior which are analogous to a chosen reference pattern. They are called nonparametric since there is no need to estimate some parameters. Analog Complexing is based on the assumption that there exist typical situations of a specific process, i.e. each actual period of state evolution of a given multidimensional time process may have one or more analogues in history. If so, it will be likely that a prediction could be obtained by transforming the known continuations of the historical analogues. It is essential that searching for analogous pattern is not only processed on a single state variable (time series) but on a set of representative variables simultaneously and objectively. These nonparametric models can be used to model and predict the most fuzzy time processes like the evolution of markets. Analog Complexing applied to financing, for example, can be viewed as a kind of multidimensional, automated and objectively working Chart Analysis(see Paper2 in the web site for a picture).
Q: In the program I found the term 'system of equations'. What does it mean and when do I have to create a system?
A: A system of equations consists of multiple, interconnected input-output models. Therefore, it is sometimes called a multi-input/multi-output system. It is possible to create a static system (consisting of static input-output models only) or a dynamic system of equations. For example, a dynamic system of 3 variables may look like this:
X1(t)=f1(X2(t), X3(t), X1(t-1), X2(t-3), X3(t-5))
X2(t)=f2(X1(t), X3(t), X2(t-3), X1(t-5))
X3(t)=f2(X1(t), X2(t), X3(t-1), X1(t-3)) .
If we want not only to model a system but also to predict it, we must satisfy an additional important condition: it has to be conflict free. In the system above each output variable depends on all others at time t. This system is not free of conflicts and therefore not applicable for prediction.
KnowledgeMiner automatically avoids such conflicts during the modeling process and creates a system which we will call a 'predictable system'. The reason for creating dynamic systems of equations is that to predict an output variable it is necessary to have values of its input variables for the forecast horizon available also. These values can be estimated or assumed values (which would lead to a what-if prediction) or can be predicted also by time series models or, as we suggest, by a system of equations (leading to a status-quo prediction).
Static systems of equations can be used for analysis, classification or diagnosis tasks.
Q: What is information basis? Is it simply another word for the data basis stored in the table?
A: For dynamic models the information basis is a superset of the data basis located in the table. For static models they are equal. The information basis is used as the data source for modeling and it contains all chosen unlagged and lagged variables. Usually, in other modeling programs, the information basis must exist physically in a table, that is each selected (unlagged or lagged) input variable must be stored in a separate table column. Here information basis and data basis must be equal. Lagged input variables as derivatives of the unlagged ones must be constructed manually by copying a column and then shifting it one or more rows down. Once this is done for each lagged variable the whole table must be shifted again, upwards now. In result of this time and resource intensive process one has a large, fixed and highly redundant table as information basis for just this specific modeling problem. For example, if you want to create a model of 5 input variables and their lagged samples of up to 5 the information basis will consist of 30 input variables all together (the 5 unlagged and their 25 lagged samples). You would have to extend the table column by column 25 times to get the desired information basis of 30 columns. If you decide to consider an additional variable or to create a model for another output variable you have to store the complete first table and then reorganize it accordingly to the new modeling problem in the way described above. Each model needs to have its own, physical information basis. This is the way it works on DOS or Windows machines.
We have ask ourselves why the information basis (and finally the user) has to meet the program's needs and not, conversely, the program has to meet the table's state? We find that the user must be able at any time and for any continuous or discontinuous core data basis to create models out of any possible input variable combination from the data basis as is. This means, KnowledgeMiner creates for you an information basis automatically and virtually using the data physically stored in the table. In this way it is possible to create theoretically an infinite number of models using one and the same data basis without any change. You only have to select the variables in the table you want to use for modeling in a defined way. Using the example above, you would have to select the first 6 rows of each column which has stored the data of one of the 5 variables you want to use (e.g. column 1, 4, 12, 13 and 25). You don't have to care about the rest. If you decide to create an alternative model using a 6th variable also you simply have to select 6 additional cells in the table and then let the computer do the rest. And both models will work on the same old data basis although they use a different information basis.
Choosing the information basis is the users most important task before creating a model. Generally, you have to answer the question 'By which input variables could my output variable be affected?'. Choosing the information basis is only an assumption that any dependence could be exist. It don't predefines the final model structure. It defines only this set of variables from which a subset of really relevant variables will be used in the final model (knowledge extraction).
Q: I was impressed by your description of KnowledgeMiner. Actually it seems to find optimal neural nets - a task I thought was more or less impossible. I want to use it in linguistics - I planned to take a vector representation of a word and map that into a representation of another form of that word. I just have one question.
* Does KM assume that there is a one-to-one or many-to-one relation in
the data or can it support one-to-many mappings as well?
A: I will try to answer your question but I'm not sure if I understand correctly.
For each variable (column) you can create and store
- a time series model (auto-regressive model) in the form
Yt = f(Yt-1, Yt-2, ..., Yt-n) at a given time t;
- a multi-input/single output model in the form
Y = f(X1, X2, ..., Xm)
whereby Y is the output variable vector, Xm the input variable vectors and f the unknown relation between these variables. The task of KnowledgeMiner is to create a large number of alternative relations f of increasing complexity and to select a best, valid relation f according to a given criterion. Or, more correctly, the three best models are selected and stored in the model base.
The Xm's can be observational data, time lagged samples of observed variables or functions of Xm (sin(x), exp(x) e.g.). Y=f(X1,X2,...Xm) can either be a static model or a dynamic model if time lagged samples are used. Then the model is reflecting a difference equation.
- a multi-input/single output model as part of a multi-input/multi-output system (system of
equations) which may have the form
X1=f1(X2,...Xm)
X2=f2(X1,X3...Xm)
.....
Xm=fm(X1,X2,...Xm-1)
For a dynamic system we obtain an autonomous system of m difference equations respectively which is applicable for status-quo prediction. Such a system reflects the interconnections between all variables and is visible through a system graph.
Q: Would you consider your products suitable for financial and product-demand forecasting using numerous variable inputs?
A: Yes, this is one of the primary application fields of KnowledgeMiner. In contrast to statistics or NN's you can use more variables than samples available. For example, you can create a prediction model (lin. system of equations e.g.) considering 40 variables but only 30 observations for each variable. You can consider up to 500 input variables (time lagged and unlagged) in KnowledgeMiner pro to model complex time processes. Additionally, KnowledgeMiner pro has implemented the Analog Complexing method as a second extremely powerful prediction technique for fuzzy processes like financial markets. KnowledgeMiner when used on financial markets could really strike gold!
Q: When you say KnowledgeMiner Lite is limited to 50 inputs, does that mean 50 rows of data? For example, assume I wanted to model the reults of 5 variables expressed as columns, can I have 50 rows of data with 5 variables in each row, or am I limited to 10 rows of data (10 rows times 5 variables equals 50 variables or inputs)? I am really confused about the limitation of 50 inputs.
A: You are not limited to 50 rows. With 50 inputs are meant input variables. For static models which contain no time lagged variables it is equal to 50 columns. For dynamic models used to reflect time processes it includes the time lagged variables also. So, the maximum data basis you can use in the Lite version is 50 colums times 300 rows = 15,000 data points. The lite version is adequate for the example you mentioned.
The maximum data basis for the pro version is 500 input variables times 1000 rows. This is comfortable for almost all problems since KnowledgeMiner don't need nearly as many cases as NN's or statistics.
We can create custom versions to meet almost any modeling need.
Q: How many rows (subjects) are necessary per columns (variables) to assure validity of the model. Of course due to limitation of the demo version (limitation on the raws taken into account), the fitted values were perfects, but predicted values of the model were worse on my personal data.
A: Generally, there is no rule on how many cases (rows) are necessary to model a specific object appropriately. The reason is that, in any case, we can deal with a finite number of data only while our real world is infinite.
The modeling method implemented in KnowledgeMiner was designed to work on small and noisy data sets and to find a model that reflects significant relationships between variables. But the more an object can vary (the more significantly different cases or subjects it can have) the more data (rows) one should use for training a model. Additionally, nonlinear models should be created on a larger set than linear models. Experience has shown that linear models created on 10 rows of data can perform very well on new data already. Its better to use 20 or 30 rows as a lower limit. For nonlinear models about 50 rows are advisable as a lower limit but it's not a must.
Q: The possibility of time lag model is really interesting too. In human training studies, the number of measures per year is very low (2-6), compare to testing variables (10-20). How many subjects are necessary too?
A: The same is true if you want to create a dynamic model. In contrast to statistics or Neural Networks KnowledgeMiner can deal with a smaller number of cases than variables (so-called under-determined tasks). So it is really possible for you to use 20 variables and 12 or 18 rows only for creation of a linear system of equations.
Q: I have a one-time problem involving the price analysis of about 200,000 vehicles, and the factors that influence price. I have identified 17 descriptors (factors that may influence price). I can probably eliminate 4 to 6 of these factors if necessary.
A: 17 descriptors are ok. They are no problem. But it must not only be a one-time problem since the pricing model may change over time. Using KnowledgeMiner you can always have up-to-date models.
The Pro versions of KnowledgeMiner can handle up to 500 input variables and 1000 rows for each variable. Special custom versions can be created for $20/input (or/100 rows) in steps of 50 inputs or 1000 rows.
Q: What kind of RAM is needed to use KnowledgeMiner?
A: For modeling of larger problems a lot of memory space is needed.Therefore KnowledgeMiner uses temporarily, for memory intensive tasks, the free RAM on your computer. The modeling process is memory intensive because it works like natural evolution: a given generation of individuals (input variables) creates a new generation of individuals by all possible pairwise combinations which in turn - after selection of the best fittest individuals - create again a new generation that better fits the desired behavior than the previous and so on. This processing requires that all individuals of a generation must live at a time in the computer so that they can be compared and selected out. The basic rule for this relation is: the more inputs used the more individuals must be created and the more memory is needed (nonlinear relationship).
Q: What would be the largest table (columns and rows)
KnowledgeMiner could accomodate if allocated 100 MB of RAM?
A: The table contains approximated values as an orientation:
100 rows - 500 inputs
200 - 350
300 - 280
400 - 240
500 - 210
1000 - 150
2000 - 110
5000 - 70
Q: Is there a limit on the number of layers subjected to analysis?
A: Yes, there is a limit. Until now it was not relevant practically, even for large systems (200 and more inputs). The process always stops automatically (an optimal complex model was found) before that physical limit.
Q: Can KnowledgeMiner read SAS files.
A: KnowledgeMiner cannot read SAS files. But SAS can save data as ASCII files and they can be read by KnowledgeMiner. Or import via copy and paste. We recommend splitting very large Data bases needed to do statistics (10 thousands of rows or so) into smaller problems which is almost always possible. Large data sets are often highly redundant. A much smaller data set can contain the same information as a much larger one. By the way, downsizing large problems is a principle KnowledgeMiner is built on also.
Q: One of the difficulties with Statistical Pattern Recognition in my application is that one might not get a sufficiently sophisticated classifier to give the best possible results (for example using a linear classifier instead of a more complex quadradic classifier). It appears that KM does not suffer from this problem because is appears to produce a bonafide nonlinear equation which should optimumly accomodate any irregular shaping of the class populations in feature space. Is this true?
A: Yes, you are correct. One important feature of KnowledgeMiner is that it creates models in an evolutionary way: from very simple models to increasingly more complex ones. It stops automatically when an optimal complex model was found. That is, when it begins to overfit the design data (the data used to create relationships between variables).
Q: It "feels" like KnowledgeMiner might assist in detecting relationships among certain patient groups by clinical criteria vs. fluid measurements that may be missed by an individual. If I understand the application of KnowledgeMiner, I believe I should be able to take our database of eye features along with the diopter measurements for each patient in the database, plug those into KnowledgeMiner and then KnowledgeMiner will derive an equation for calculating the diopter measurement of a patient as a function of the patient's image features. Is this true?
A: Yes, exactly. This is something KnowledgeMiner can do.
Q: Are there statistical tests of significance that can be applied to derived model coefficients, i.e., a t-test or an F-test? I ask about the latter because one of the major criticisms I've heard levied against neural network models is that there are not statistical tests that can be applied against the derived model. Hence, it is sort of like "eyeballing" a regression line and having no way to determine if my "eyeballed" model is any better than your "eyeballed" model.
A: Your question focuses the problem on which criterion/method is used to select significant variables and their parameters (weights) and how valid the model will be.
Generally, this is one of the most difficult questions for all methods which intend to solve a black-box problem: statistics, Neural Networks (NN's), Genetic algorithms, GMDH etc. This is particulary true for short and noisy data samples.
In statistics, tests for significance (t-test, F-test) are used to get an information on the relevance of parameters. However, these tests have two main disadvantages:
1. The smaller the data sample and the more noisy it is the more randomness will be comprehended by the model. The results are so-called overfitted models.
2. They presuppose that the number of potential input variables is equal to the number of variables really used in the model. Usually, however, one is interested in finding a subset of the potential input variables - the most relevant - for model construction only. Due to this contradiction the resulting models tend to be over-complicated and more or less instable. Again, they will overfit the data used to create the model (design data).
In other words, the more potential input variables are considered a priori for modeling (which can be random values also) the better the model "quality" will be accordingly to these significance tests. This so-called artificial skill is dangerous since the risk that the model will reflect only random relationships increases with the number of potential input variables.
For Neural Networks the problem is similar. However, NN's try to perform a global search on a high dimensional and multimodal surface to find an optimal model. This approach is much more risky, much more sensitive and less effective in most cases than advanced statistics.
However, most important is that NN's show the same fatal relationship as statistical modeling methods: the more complex the model is the better its "quality" (or, the smaller its error) will be. This means, one can always get a small model error on design data making it more complex (and thus, to overfit the model) by adding more hidden neurons or an additional hidden layer to the network. The performance on new data, however, might be catastrophical due to the models reflection of randomness.
The dilemma of statistics and NN's is that there is no rule to find a model structure/complexity which might be optimal for a given task: if a model too simple it performs badly, if it too complex it performs badly on new data also.
To avoid overfitted models an additional principle is needed which must consider model complexity. This task is equal to the problem of finding an optimal model structure or, in NN jargon, an optimal network topology. Such a principle which we call induction is implemented in GMDH:
- the cybernetic principle of self-organization as an adaptive creation of a
network without subjective points given;
- the principle of external complement (external criterion) enabling an objective
selection of a model of optimal complexity and
- the principle of regularization of ill-posed tasks.
Related to your question the second item of external criterion is of special importance. An external criterion values the model quality not only on how good it fits the design data but additionally on how good it performs on data not yet used for model creation (testing or validation data). Now, the relation describing the model quality in dependence from model complexity (structure) is unimodal, i.e. it certainly has a defined maximum.
A data set can be divided into design and validation data either explicitely using one of several splitting rules or implicitely dynamically. KnowledgeMiner uses the latter method by applying the cross-validation principle: each time a new model structure was estimated this intermediate model is cross-validated against unseen data. The performance result is used to decide whether the new structure increases the models quality or not (Active Neuron). Additionally, since not only one model but a large set of models in a layer (population, generation) live at a time, the selection process performed on a second, auxiliary criterion also provides another way to avoid overfitting.
Fortunately, the user don't has to care about this because the whole model creation, validation and selection process is performed automatically by KnowledgeMiner to find an optimally complex and stable model reflecting relevant relationships.
However, since modeling is processed on a finite number of data in any case, models always will keep some uncertainty. So, even if results of different models are comparable, it might be quite important what efforts were necessary to obtain the results.
In contrast to NN's, KnowledgeMiner provides the generated models in analytical form. Therefore it is possible to test the parameters for their significance afterwards if desired or to re-estimate the parameters on the created optimal model structure.
Q: I agree with this, but a real problem arises with smaller data sets that simply cannot be divided into even smaller samples or when the original data includes stratas, but some of the strata have too few observations to lend themselves to being further divided. For example, I recently worked with a dataset of only 26 observations. Although marginally adequate for regression analysis (in this case with only two explanatory variables or we would have other problems!) it would have been difficult at best to justify dividing the sample further into two subsamples of only 13 observations each. Of course this further ignores other questions regarding the true randomness of the sample and having a sample size adequate to reflect the underlying population distribution.
How does your approach deal with something like the above? Can your approach provide reliable modeling with such small sample sizes?
A: First, you don't have to subdivide the data explicitely. Explicit subdivision of the data sample is performed, meanwhile, by NN's. More powerful is dividing the data dynamically during modeling internally (a kind of sliding window) as performed by KnowledgeMiner using cross-validation. In this way virtually all data are used as training as well as testing data but not at a same time.
Your example explains a typical application scenario for GMDH where statistics and NN's must quit. In the free downloadable KnowledgeMiner package you can find examples with that dimension of observations: a model of the US-economy using 29 quaterly observations for 21 output variables, a model of the german economy using 28 yearly observations for 13 variables and a balance sheet prediction model using only 7! yearly observations for 13 variables. This is the extreme.
Generally KnowledgeMiner is applicable for under-determined tasks. For the US economy e.g. KnowledgeMiner creates a linear system of 21 difference equations using 103 potential input variables (20 unlagged and 83 lagged variables for a time lag up to 4) on 25 observations automatically. The information matrix containing the lagged and unlagged samples was created automatically also.
Q: The method for defining inputs and outputs in KnowledgeMiner is a bit archane. It is the least user friendly part of the program, and also the most likely part to give a user problems and to cause errors. I think I see why this approach was take (the spreadsheet approach is nice and very flexible) but its just a bit too hard for a program that is otherwise so easy.
A: Our experience has shown that
- first-time users have problems with the variable selection technique since it is new and not intuitive enough;
- once a user has understood the way to select variables (you can use the Selection Mask On/Off menu item in the table menu to get some visual assistence) it was found the most flexible, fastest and easiest method (minimum amount of steps necessary). Any discontinuous selection is possible and accepted. This is important for selecting certain variables or time lags only. It is very easy to click into the corresponding cells to make any possible variable selection if one knows how to interpret these selections. So, it may be a problem of explanation rather than a problem of handling. A priori false selections are ignored by the program also.
We are open to all suggestions to improve the interface and operation of KnowledgeMiner please continue to experiment and get back to us with your recommendations.